runc create/run: warn on rootless + shared pidns + no cgroup #4398

kolyshkin · 2024-09-12T01:01:11Z

Shared pid namespace means runc kill (or runc delete -f) have to
kill all container processes, not just init. To do so, it needs a cgroup
to read the PIDs from.

If there is no cgroup, processes will be leaked, and so such
configuration is bad and should not be allowed. To keep backward
compatibility, though, let's merely warn about this for now.

Alas, the only way to know if cgroup access is available is by returning
an error from Manager.Apply. Amend fs cgroup managers to do so (systemd
doesn't need it, since v1 can't work with rootless, and cgroup v2 does
not have a special rootless case).

Related to #4394, #4395.

AkihiroSuda · 2024-09-12T01:06:41Z

libcontainer/configs/validate/validator.go

@@ -42,6 +42,7 @@ func Validate(config *configs.Config) error {
 	// Relaxed validation rules for backward compatibility
 	warns := []check{
 		mountsWarn,
+		rootlessSharedPidns, // TODO: make it an error in runc 1.3.


No need to error out if we implement walking the process tree

No need to error out if we implement walking the process tree

To me it looks neither possible nor desirable.

Not possible because

init process might be gone already (with some other processes still running);

walking the tree requires freezing all the processes, as otherwise the walker will be racing with forks;

Not desirable because

the code will probably be very slow and resource hungry, thanks to text-based nature of /proc;

cgroup v1 is going to be obsolete (one day it will; fingers crossed).

Also, the only possible way to implement this process tree walking nicely (i.e. not slow and resource hungry) would be via ebpf (which has access to kernel-internal data structures), but the kernels supporting it are probably running cgroup v2 already. Even with ebpf, other issues cited above remain.

kolyshkin · 2024-09-12T06:15:03Z

@AkihiroSuda I think we need more than was done in #4395.

First, I am a bit puzzled why we see

level=warning msg="failed to kill all processes, possibly due to lack of cgroup (Hint: enable cgroup v2 delegation)" error="container not running"

from runc create.

Second, I guess we need to add a special case to (*Container).Signal to not log the error when cgroup v2 + systemd >= v245 is used, because in such case we only need to kill the initial process and systemd takes care of the rest (i.e. it kills all processes in a cgroup when the initial process is gone).

WDYT?

kolyshkin · 2024-09-12T06:24:27Z

Maybe we'll also need to call c.cgroupManager.Destroy() from (*Container).Signal when systemd manager is used, as a last resort. Currently this is only done during runc delete -f but not during runc kill -9.

lifubang · 2024-09-12T22:58:15Z

First, I am a bit puzzled why we see

Mainly because of ‘delete -f’ in teardown?

runc/tests/integration/helpers.bash

Lines 726 to 737 in 4833305

    
           function teardown_bundle() { 
        
           	[ ! -v ROOT ] && return 0 # nothing to teardown 
        
           	cd "$INTEGRATION_ROOT" || return 
        
           	teardown_recvtty 
        
           	local ct 
        
           	for ct in $(__runc list -q); do 
        
           		__runc delete -f "$ct" 
        
           	done 
        
           	rm -rf "$ROOT" 
        
           	remove_parent 
        
           }

kolyshkin · 2024-09-12T23:56:42Z

First, I am a bit puzzled why we see

Mainly because of ‘delete -f’ in teardown?

You are right; I was sure messages from __runc (which is part of teardown) are not logged, and yet they are.

kolyshkin · 2024-09-13T00:45:58Z

OK, validation check can't work right as it does not know whether cgroup is actually accessible. Need to log a warning later when manager.Apply fails. Reworked this PR to do just that.

lifubang · 2024-09-14T14:44:09Z

libcontainer/process_linux.go

@@ -580,7 +580,18 @@ func (p *initProcess) start() (retErr error) {
 	// cgroup. We don't need to worry about not doing this and not being root
 	// because we'd be using the rootless cgroup manager in that case.
 	if err := p.manager.Apply(p.pid()); err != nil {
-		return fmt.Errorf("unable to apply cgroup configuration: %w", err)
+		if errors.Is(err, cgroups.ErrRootless) {


For the purpose of warning, I think this PR LGTM.
But further more, I think we should serialize this to state.json, for example with a field name noCgroup, it is useful for doing runc kill.
Please see: #4395 (comment)

Yes, I was thinking about it, too, but let's do improvement at a time, shall we?

So, we are trying to cut 1.2.0 for some time now, and this PR is a result of a recent development in the area (a regression described in #4394 (and fixed by #4395). As I noted in #4394 (comment), I'm OK with the fix in #4395, but it would be nice to also introduce a warning; this is what this PR does.

I think we can introduce noCgroup in runc 1.3. Feel free to open an issue about it so we won't forget.

rata

LGTM, thanks!

rata · 2024-09-16T10:06:42Z

libcontainer/process_linux.go

+		if errors.Is(err, cgroups.ErrRootless) {
+			// ErrRootless is to be ignored except when the
+			// container doesn't have private pidns.
+			if !p.config.Config.Namespaces.IsPrivate(configs.NEWPID) {


Any reason to not put this condition in the previous if? Do you think it is more readable like this?

I did that initially, but rolled it back later since changing the code like this complicates the review (the whole block changes instead of just one line).

I will add a separate commit that does it.

My bad; I mixed this up with something else. Updated; PTAL @rata

Ughm, I just broke everything.

This is two if statements (and an else) because we want to

ignore ErrRootless;

except there's no private pidns, in which case we want a warning;

return all other errors as is.

No way to do that with a single if.

Makes sense, I tried to collapse it and I'm not sure it is more readable than this.

It's not a question of readability. There are three different conditions, can't do it with a single if.

kolyshkin · 2024-09-18T02:18:06Z

@AkihiroSuda PTAL

In these cases, this is exactly what we want to find out. Slightly improves performance and readability. Signed-off-by: Kir Kolyshkin <[email protected]>

This aids in failed test analysis by allowing to distinguish the output of various commands being run as part of the test case from the output of teardown command like runc delete. Signed-off-by: Kir Kolyshkin <[email protected]>

Shared pid namespace means `runc kill` (or `runc delete -f`) have to kill all container processes, not just init. To do so, it needs a cgroup to read the PIDs from. If there is no cgroup, processes will be leaked, and so such configuration is bad and should not be allowed. To keep backward compatibility, though, let's merely warn about this for now. Alas, the only way to know if cgroup access is available is by returning an error from Manager.Apply. Amend fs cgroup managers to do so (systemd doesn't need it, since v1 can't work with rootless, and cgroup v2 does not have a special rootless case). Signed-off-by: Kir Kolyshkin <[email protected]>

lifubang · 2024-09-18T09:58:38Z

libcontainer/cgroups/fs/fs.go

@@ -129,14 +129,15 @@ func (m *Manager) Apply(pid int) (err error) {
 			// later by Set, which fails with a friendly error (see
 			// if path == "" in Set).
 			if isIgnorableError(c.Rootless, err) && c.Path == "" {
+				retErr = cgroups.ErrRootless


Do you think this is too strict? Maybe we should add a condition here:

if name == "devices" { retErr = cgroups.ErrRootless }

It doesn't really matter here because there can't be a situation where devices cgroup can't be created while others can.

lifubang · 2024-09-18T10:04:45Z

libcontainer/process_linux.go

+			// the container doesn't have private pidns.
+			if !p.config.Config.Namespaces.IsPrivate(configs.NEWPID) {
+				// TODO: make this an error in runc 1.3.
+				logrus.Warn("Creating a rootless container with no cgroup and no private pid namespace. " +


How about s/no cgroup/no devices cgroup/ ?

I think such message will have a negative effect, as it will be more confusing to a user (especially in cgroup v2 case).

AkihiroSuda reviewed Sep 12, 2024

View reviewed changes

kolyshkin force-pushed the no-shared-pidns branch from e62e4be to 0980635 Compare September 12, 2024 01:36

kolyshkin requested a review from AkihiroSuda September 12, 2024 02:22

kolyshkin force-pushed the no-shared-pidns branch from 0980635 to 2eb28b5 Compare September 12, 2024 05:52

kolyshkin changed the title ~~runc create/run: warn on cgroup v1 + shared pidns + rootless~~ runc create/run: warn on shared pidns + rootless + no cgroup delegation Sep 12, 2024

kolyshkin force-pushed the no-shared-pidns branch from 15b4383 to 4833305 Compare September 12, 2024 22:51

kolyshkin force-pushed the no-shared-pidns branch from 4833305 to c2b04a3 Compare September 12, 2024 23:36

kolyshkin force-pushed the no-shared-pidns branch from c2b04a3 to fd38d2d Compare September 13, 2024 00:45

kolyshkin force-pushed the no-shared-pidns branch 3 times, most recently from 2c48e0a to 7168d59 Compare September 13, 2024 17:41

kolyshkin changed the title ~~runc create/run: warn on shared pidns + rootless + no cgroup delegation~~ runc create/run: warn on rootless + shared pidns + no cgroup Sep 13, 2024

kolyshkin force-pushed the no-shared-pidns branch 2 times, most recently from e930bf9 to 5c864e9 Compare September 13, 2024 18:16

kolyshkin marked this pull request as ready for review September 13, 2024 18:16

kolyshkin requested review from rata, lifubang and cyphar September 13, 2024 18:16

kolyshkin force-pushed the no-shared-pidns branch from 5c864e9 to 805f27c Compare September 13, 2024 19:51

lifubang reviewed Sep 14, 2024

View reviewed changes

rata approved these changes Sep 16, 2024

View reviewed changes

kolyshkin force-pushed the no-shared-pidns branch from 805f27c to 3ff4bd3 Compare September 18, 2024 02:31

kolyshkin added the area/rootless label Sep 18, 2024

kolyshkin force-pushed the no-shared-pidns branch from 3ff4bd3 to bb6c848 Compare September 18, 2024 05:41

kolyshkin force-pushed the no-shared-pidns branch from bb6c848 to 3e79e53 Compare September 18, 2024 05:45

kolyshkin added 3 commits September 17, 2024 22:49

libct: use Namespaces.IsPrivate more

b1449fd

In these cases, this is exactly what we want to find out. Slightly improves performance and readability. Signed-off-by: Kir Kolyshkin <[email protected]>

tests/int: log when teardown starts

21c6116

This aids in failed test analysis by allowing to distinguish the output of various commands being run as part of the test case from the output of teardown command like runc delete. Signed-off-by: Kir Kolyshkin <[email protected]>

kolyshkin force-pushed the no-shared-pidns branch from 3e79e53 to 30f8f51 Compare September 18, 2024 05:49

lifubang reviewed Sep 18, 2024

View reviewed changes

AkihiroSuda approved these changes Sep 24, 2024

View reviewed changes

kolyshkin merged commit a1acfcf into opencontainers:main Sep 24, 2024
42 checks passed

github-actions bot mentioned this pull request Oct 27, 2024

Bump runc from v1.1.13 to v1.2.0 kokyhm/kubespray#60

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runc create/run: warn on rootless + shared pidns + no cgroup #4398

runc create/run: warn on rootless + shared pidns + no cgroup #4398

kolyshkin commented Sep 12, 2024 •

edited

Loading

AkihiroSuda Sep 12, 2024

kolyshkin Sep 12, 2024

kolyshkin Sep 12, 2024

kolyshkin commented Sep 12, 2024

kolyshkin commented Sep 12, 2024

lifubang commented Sep 12, 2024

kolyshkin commented Sep 12, 2024

kolyshkin commented Sep 13, 2024

lifubang Sep 14, 2024

kolyshkin Sep 17, 2024

kolyshkin Sep 18, 2024

rata left a comment

rata Sep 16, 2024

kolyshkin Sep 17, 2024

kolyshkin Sep 18, 2024

kolyshkin Sep 18, 2024

rata Sep 24, 2024

kolyshkin Sep 24, 2024

kolyshkin commented Sep 18, 2024

lifubang Sep 18, 2024 •

edited

Loading

kolyshkin Sep 24, 2024

lifubang Sep 18, 2024

kolyshkin Sep 24, 2024

runc create/run: warn on rootless + shared pidns + no cgroup #4398

runc create/run: warn on rootless + shared pidns + no cgroup #4398

Conversation

kolyshkin commented Sep 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kolyshkin commented Sep 12, 2024

kolyshkin commented Sep 12, 2024

lifubang commented Sep 12, 2024

kolyshkin commented Sep 12, 2024

kolyshkin commented Sep 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rata left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kolyshkin commented Sep 18, 2024

lifubang Sep 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kolyshkin commented Sep 12, 2024 •

edited

Loading

lifubang Sep 18, 2024 •

edited

Loading